Dear Apertiumers, dear Gabriel:
thanks a lot for this information. This adds to a discussion we had in
the channel before, and there are already some outstanding proposals to
solve this which I haven't been able to pay enough attention to,
absorbed as I was by Google Code-In tasks.
We should look closely into it (and check how other systems deal with
it) as it is a crucial one.
Thanks again
Mikel
Al 01/05/2014 12:34 PM, En/na Gabriel Esteban Gullón ha escrit:
I write to you because while I was working in one of the tasks of GCI,
mlforcada an dI discovered that is very difficult to do a perfect
format handling in Apertium with the actual engine. So, in the
following lines I will explain my task, the procedure that I used for
trying to solve the task and the particular error.
Task URL:
http://www.google-melange.com/gci/task/view/google/gci2013/4560097506230272
The task consisted on finding a transfer rule that messes up
wordprocessor format. The first thing that one needs to do for this
task is find a file that when it's translated using apertium messes up
wordprecessor format. Then, you need to isolate the problem, and try
to arrive to the minimum file that causes the problem. When you find
one of these format errors that are probably caused by the wrong order
in the superblanks what you need to do is find the rule that causes
the wrong order and repair it.
The first part of the task (finding an example), which I did it in
another task, only consisted in finding an example and uploading it
to the Apertium svn repository. The file that I chose for this task is
the one that is uploaded with the name "file6.odt".
First task URL:
http://www.google-melange.com/gci/task/view/google/gci2013/6154303467159552
Svn URL:
http://sourceforge.net/p/apertium/svn/HEAD/tree/trunk/apertium-en-ca/dev/odt-tests/
<http://sourceforge.net/p/apertium/svn/HEAD/tree/trunk/apertium-en-ca/dev/odt-tests/file6.odt>file6.odt
<http://sourceforge.net/p/apertium/svn/HEAD/tree/trunk/apertium-en-ca/dev/odt-tests/file6.odt>
In the following line you can read the first example that I
found. It's worth noting that there are two different formats, the
first one, that isn't typeset in italics and the second one that is.
officer's /home/ head
These two different formats are written with tags, as in HTML;
therefore, our problem can be expressed as:
<a>officer's<b>home<c>head<d>
To try to find more examples, we would run
echo "<a>man's<b>green<c>work<d>" | apertium -f html-noent -d . en-ca
and we would try to find more examples changing words and structures.
When we find some examples, we need to guess some kind of rule that
allows us to reproduce the problem with similar cases.
When we run apertium using the above command, we would see that the
tags of the output are reordered, as you can see on the following
examples:
$ echo "<a>man's<b>green<c>work<d>" | apertium -f html-noent -d . en-ca
<a>La feina<c>verda de l'home.<b><d>
$ echo "<a>developers'<b>main<c>chief<d>" | apertium -f html-noent -d
. en-ca
<a>El cap<c>principal dels<b>desenvolupadors<d>
$ echo "<a>developer's<b>house<c>chief<d>" | apertium -f html-noent -d
. en-ca
<a>El cap<c>de casa del<b>desenvolupador<d>
$ echo "<a>officer's<b>home<c>head<d>" | apertium -f html-noent -d . en-ca
<a>El cap<c>de casa de l'agent.<b><d>
$ echo "<a>man's<b>chair<c>work<d>" | apertium -f html-noent -d . en-ca
<a>La feina<c>de cadira de l'home.<b><d>
So now that we have found more examples, we can guess that rule that
we are searching. What we find is that this specific problem happens
in a phrase like:
[noun + possessive s] + [adj] + [noun]
To simplify this, I will use the following notation [A's] + [B] + [C].
In all of the examples that have this problem, [B]
and [A's]complements [C].
To try to know in what process this happens, we run apertium with the
modes en-ca-chunker, where the tags happen not to be not moved around,
and en-ca-interchunk. And we find that the problem appears in the
interchunk, so, if the problem is of a transfer rule it will be in the
.t2x file.
Going further through the problem mlforcada told me to analyse the
following simplified sentence "green<AAA>man</AAA>'s house". When we
translate this using
echo "green<AAA>man's</AAA>house" | apertium -f html-noent -d . en-ca
what one gets is the same as in other cases.
mode: en-ca
input: green<AAA>man</AAA>'s house
output: La casa de l'home</AAA><AAA>verd
To try to find in what step of apertium this problem appears we
would change the mode. The first mode that we would try is the
en-ca-chunker.
mode: en-ca-chunker
input: green<AAA>man</AAA>'s house
output:
^Nom_adj<SN><UNDET><m><sg>{^home<n><3><4>$<AAA>^verd<adj><3><4>$}$
^pr<GEN>{}$</AAA>^nom<SN><UNDET><f><sg>{^casa<n><3><4>$}$^punt<sent>{^.<sent>$}$
What we can see is four different chunks:
* ^Nom_adj<SN><UNDET><m><sg>{^home<n><3><4>$*<AAA>*^verd<adj><3><4>$}$
* ^pr<GEN>{}$
* ^nom<SN><UNDET><f><sg>{^casa<n><3><4>$}$
* ^punt<sent>{^.<sent>$}$
If we read this carefully we would see that inside the first chunk the
tag <AAA> is included. What this causes is that when this text
continues to the interchunk, this can't do anything to correct the
phrase and remove the <AAA> that's inside the chunk. We can see this
in the mode interchunk, that is posted in the following lines.
mode: en-ca-interchunk
input: green<AAA>man</AAA>'s house
output: ^Nom<SN><PDET><f><sg>{^casa<n><3><4>$}$
^pr<PREP>{^de<pr>$}$</AAA>^nom_adj<SN><PDET><m><sg>{^home<n><3><4>$<AAA>^verd<adj><3><4>$}$^punt<sent>{^.<sent>$}$
As we can see, the first chunk, that contains "green<AAA>man's" is
moved before the closing tag and the first chunk. So, this chunk that
contains <AAA> tag is moved before </AAA> tag, and this is what makes
the tagging invalid. As the tag is inside the chunk, the interchunk
can't do nothing to move the position of the tag, so, it's impossible
to solve this problem without changing the engine.
The conclusion is that the task is impossible to complete because this
problem can't be solved changing a rule, it's a major problem.
/Gabriel Esteban (@galaxyfeeder)/
/5//th //January// 201//4/
------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT
organizations don't have a clear picture of how application performance
affects their revenue. With AppDynamics, you get 100% visibility into your
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff
--
Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/)
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03071 Alacant, Spain
Phone: +34 96 590 9776
Fax: +34 96 590 9326
------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff