Respected mentors and Apertium community,
My name is Binay Neekhra.
I am interested in doing 'Apertium assimilation evaluation toolkit' as a
GSoC project.

I have attempted the coding challenge. You may have a look at the
Sourceforge link
sourceforge.net/projects/basicassimilationtoolkit/files/?source=navbar

In brief:
For this task, I have taken Basque-English pair (Thanks to Mr. Tyres for
suggesting this). I have written a basic Python program basicToolkit.py

There are 6 files,
1.  README file codepad.org/iEO8PkWa
2. 'Source Sentences.txt' codepad.org/HwRIZsLx
    contains Basque Sentences, taken from the news site berria.info
3. 'Apertium Translation.txt' codepad.org/cL8JXY6L
    contains Apertium eu-en machine translation output of the 'Source
    Sentences.txt'
4. 'Reference Translation.txt' codepad.org/eZqNmlMK
    which contains Google Translate output of the Source Sentences. This
    output is taken for evaluation purpose.
5. basicToolkit.py codepad.org/a7KAC7U2
    It takes sentences from 'Source Sentences.txt', 'Apertium
Translation.txt', and
    'Reference Translation.txt', and based on hint level chosen by the
user, it
    shows the relevant hints and ask the user to complete the cloze test.
The
    response is recorded in a separate file(userOutput.txt).
6. userOutput.txt codepad.org/mN7m8H0x
    contains the output of assimilation evaluation performed on the above
files. it
    contains user input, %of holes successfully filled, %of blanks left,
    reference sentences, hint level and few other details.

I have read the paper suggested by Prof. Forcada (Peeking through the
language barrier: the development of open-source gisting system for
Basque to English based on apertium.org), along with H.Somes and E.Wild
paper on 'Evaluating Machine Translation: the Cloze Procedure Revisited'.

I have also gone through the Apertium documentation and modules
specification
of Apertium in brief. I have installed Apertium and running it for
Basque-English
and Esperanto-English pair.

I have following observations/ideas:
The toolkit can provide the following options:

For masking procedure,
1. An option to select what % of words to be masked
2. option for masking the words randomly or at regular intervals,
3. words may also be selected on the basis of their POS tags.
The system may provide the option to select the distribution of POS tags
to mask the words.
e.g. 20% Nouns/Pronouns, 40%Verb, 40%Adjectives etc.
  for this we need to integrate the part-of-speech tagger of Apertium with
the
  toolkit.

For evaluation purposes, the system can use synonym list to look up for
similar words (acceptable answers) or have binary evaluation.

For proper name, figure or date, it may be difficult for the user to guess
the
correct output, these fields may be handled separately, in which case  a
plausible but wrong guess may be acceptable.

In the paper, H.Somers and Wild mention that "we feel confident that the
exact-answer
scoring method is adequate, and that allowing near synonyms and so on does
not give
a different result". I feel that in the case of gisting, however, using
'acceptable answers'
will be significant. It is also reflected in the results obtained in
'Peeking through..
...on apertium.org' paper. Am I correct?

I still not have very clear idea  about what counts as a 'correct' answer
and
how do we calculate effective 'score' while comparing two machine
translation systems?

I am very interested in doing this project. How should I proceed further?

My language preference for this project is Python(flexible). I want to use
Python
for both text and web based formats(using web2py or Django framework),as it
will allow better maintenance, and code fixing. (I have done projects in
web2py framework).
If needed I will be able to develop the toolkit on Ruby on Rails too.(I am
familiar with Ruby)

I tried to open the Apertium Bugs page on Bugzilla(link on Ideas page). The
page
is showing  the 'Internal Server Error'. Is their any other address for the
Apertium
bug listing?

About Me:
My name is Binay Neekhra. I am pursuing B.Tech + M.S.(by research) in
Computer Science and Engineering, from International Institute of
Information
Technology-Hyderabad, India. I am pursuing my M.S. in Language Technology
Research Centre, IIIT-H. My research interests are  Machine Translation,
Natural
Language Processing, Artificial Intelligence and Theoretical Computer
Science.

-Binay Neekhra
IRC nick: niks, binayneekhra
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to