Hello all,
I am a B.E. student form China. My name is David Ho. I have contacted with
fran on IRC, and decided to write a proposal for the idea "prototype
recursive transferimplementation" Here is my proposal.
It would be a great pleasure for me if any of you would like to have a look
at it and give some advice for me. This is a draft not the final edition.

(ps:if you would like a better reading experience, this is the link to it:
http://wiki.apertium.org/wiki/User:Davidho)

Contact information

Name: Junhao He

Email: [email protected]

IRC: Davidho

Why is it you are interested in machine translation?

I am a Chinese and have learned English for more than 10 years and Spanish
for 2 years. However, when I encounter some sentences or phrases in English
or Spanish that I cannot comprehend, none of translation systems so far
satisfies me. The longer I learn foreign languages, the more I understand
the differences between Chinese and them. I always want to create something
which can handle Chinese translation appropriately, but I also know it will
be a huge project. It was not until the course about compiler last year
that I knew how a translator worked. And it was the time that I got being
interested in machine translation.



Why is it that you are interested in the Apertium project?

The first time I came across Apertium was when I was reading the accepted
projects list of GSoC 2013. And it was the Chinese-Spanish Apertium System
that attracted me. Before knowing Apertium, I had no idea how to start Then
I started to read documentations about Apertium and joined the IRC channel
#apertium. After doing some research on Apertium, I found three
characteristic of the system that impressed me. The first and the most
important one is that Apertium is an open-source machine translation engine
and has been expanded to treat more divergent language pairs. It is
well-designed and allows everyone to contribute to it. This ensures its
continuous growth and convinces me of its great prospect. Second, the
linguistic data files are encoded in XML-based formats. XML files are easy
to understand, which enables those who have little linguistic knowledge to
expand the dictionaries. This is helpful to improve the quality of existing
pairs and to adopt new pairs.



Which of the published tasks are you interested in?

*Prototype recursive transfer implementation*

What do you plan to do?

*Before GSoC* Help to improve the quality of zho-spa language pair,
especially the transfer rules part. I think this is helpful to understand
deeper the transfer rules.

*Community bonding period:* Go through the Apertium documentation and get
more familiar with the system. Get further contact with the community Do a
review of finite-state dependency parsing and LALR(1) grammars.


*Week 1:* propose a new formalism of transfer rules and discuss it with the
mentor.

*Week 2: *propose a new formalism of transfer rules and discuss it with the
mentor. And make the formalism a formal documentation.

*Week 3:*complete the documentation of the new format. And write a number
of transfer rules in the new formalism between Chinese and Spanish or
English.

*Week 4:*continue to write transfer rule and list them in a clever way.

*Delievable 1:*A documentation of the new formalism and a list containing a
numbers of transfer rules.

*Week 5:*Rewrite rules of the Chinese and Spanish pair using the new
formalism.

*Week 6:*Rewrite rules of the Chinese and Spanish pair using the new
formalism.

*Week 7:*Rewrite rules of the Chinese and Spanish pair using the new
formalism.

*Week 8:* Make tests, debug and write documentations.

*Delievable 2:*XML files of zho-spa pair with rewritten transfer rules.

*Week 9:* Integrate the new rules with Chinese and Spanish pair.

*Week 10:* Integrate the new rules with Chinese and Spanish pair.

*Week 11:* Make tests and debug.

*Week 12:* Clean up and dissemination.

*Delievable 3:* A full implementation of a prototype recursive transfer



Reasons why Google and Apertium should sponsor it

Apertium was designed to translate between closely related languages. And
this translation does not involve much constituent reordering. However,
with the development of the system, it is inevitable but significantly
beneficial to expand to treat more divergent language pairs, of which
reordering would be a key concern. This project aims to develop a prototype
of a new module that can handle long-distance reordering. It will be a long
stride for the whole system if it succeed. By the way, the zho-spa
(Chinese-Spanish) pair was created in GSoC 2013. But it is not qualified to
be released because it does not meet the demand of quality. And I also know
that there is no one working on this pair. It is a badly waste. These two
languages are the most spoken languages in the world. So I am convinced
that this pair is of great value. But the difference between them makes it
hard to be developed. I think if I can eventually propose a new formalism
of transfer rules, it will help to reduce the difficulty of development.

A description of how and who it will benefit in society

Chinese is the most spoken language in the world. And the need to
communicate with foreign people grows rapidly. And I believe there will be
more and more people interested in China and they might want to learn
Chinese. However, learning Chinese is not an easy job. A system that can
translate Chinese into other languages will definitely be a great help for
everyone who wants to learn Chinese and for Chinese who want to communicate
with foreigners.

List your skills and give evidence of your qualifications

I am a 3rd-year undergraduate majoring in Software Engineering in South
China University of Technology.

I am skillful to code with C/C++ and I have done some projects using this
programming language. I am also able to use python to carry out some small
tasks. I had courses of Principles of Compilers and Formal Languages last
year. It was them that made me interested in Natural Language Processing.
And I believe that knowledge of parsers, syntax analyzers, finite automatas
and finite state transducers will help me to understand the Apertium system
deeper.

I can speak three languages. They are Chinese(mother tongue),
English(fluent) and Spanish(refreshing) respectively. These three language
comes from three different language systems. And I am sure knowing the
differences among them is of great help to propose a new formalism of
transfer rules.

I am working on implementing a part of functions of a columnar database. It
involves some techniques of parallel programming like OpenMP, MPI and
pthread. It is a huge project and I have to work with some other people
through the Internet. So I am quite confident that I am capable of
finishing the programming work from distance.

List any non-Summer-of-Code plans you have for the Summer

Before my summer vacation begins, there will be a final exam at the end of
June. It may last one week. Aside from that, there will be nothing I focus
on but the Apertium project.
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to