Yet another thing that could be done, even if a bit intensive, is the following (which can be done with ANY MT system):

  1. Split the source sentence in all possible segments of 1, 2, 3...
     words (for N words, there would be N(N+1)/2 segments).
  2. Translate those segments with Apertium.
  3. Look for their translation in the Apertium translation of the
     whole target sentence.
  4. Annotate alignments.

This would only give alignments where the matches are exact; therefore, transformations such as agreement or reordering would "delete" word-to-word alignments and you would only get larger segments.

There would also be the risk of having multiple alignments in case of repeated words, etc, but then Fran's suggestion of fake XML tags containing numbering would help.

Mikel



On 07/24/2011 06:29 PM, Francis Tyers wrote:
El dg 24 de 07 de 2011 a les 12:21 -0400, en/na Hector Villafuerte va
escriure:
[...]
Yes it is possible, but I don't know of anyone interested in doing it.
If you want some tips on how I'd do it, you can respond here.

Fran


Yes, please :)
You can try using fake XML tags:



$ cat>  /tmp/foo

<w pos="1">This</w>  <w pos="2">is</w>  <w pos="3">a</w>  <w
pos="4">big</w>  <w pos="5">house</w>  <w pos="6">.</w>

$ cat /tmp/foo | apertium -d . -f html en-ca

<w pos="1">Això</w>  <w pos="2">és</w>  <w pos="3"></w>  <w pos="4">una
casa</w>  <w pos="5">gran</w>  <w pos="6">.</w>

or:

$ cat /tmp/foo
<w pos="1"/>This<w pos="2"/>is<w pos="3"/>a<w pos="4"/>big<w
pos="5"/>house<w pos="6"/>.

$ cat /tmp/foo | apertium -d . -f html en-ca
<w pos="1"/>Això<w pos="2"/>és<w pos="3"/>  <w pos="4"/>una casa<w
pos="5"/>gran<w pos="6"/>.

The problem is that in some pairs, superblanks are reordered and merged,
so you might lose some info.

Another thing you could do is to insert a tag after each LU after the
tagger, e.g.

^This<prn><tn><mf><sg><#1>$ ^be<vbser><pri><p3><sg><#2>$
^a<det><ind><sg><#3>$ ^big<adj><sint><#4>$
^house<n><sg><#5>$^.<sent><#6>$

But then you would need to edit the transfer files of all the pairs to
print these out. Also, you would need to remove them before generation.

You could also try hacking the transfer to add a superblank before each
LU in addition to the existing superblanks that come in. So e.g. get it
to print out [@pos]^ every time it prints out a '^' from an LU.

Those are my ideas for now, let me know if you have any other questions.

Fran


------------------------------------------------------------------------------
Magic Quadrant for Content-Aware Data Loss Prevention
Research study explores the data loss prevention market. Includes in-depth
analysis on the changes within the DLP market, and the criteria used to
evaluate the strengths and weaknesses of these DLP solutions.
http://www.accelacomm.com/jaw/sfnl/114/51385063/
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


--
Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/)
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03071 Alacant, Spain
Phone: +34 96 590 9776
Fax: +34 96 590 9326

------------------------------------------------------------------------------
Storage Efficiency Calculator
This modeling tool is based on patent-pending intellectual property that
has been used successfully in hundreds of IBM storage optimization engage-
ments, worldwide.  Store less, Store more with what you own, Move data to 
the right place. Try It Now! http://www.accelacomm.com/jaw/sfnl/114/51427378/
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to