Hello everyone,

this email gives some details about version 0.1.0 of the Apertium MT
system for Maltese to Arabic. It has just been released;
apertium-mt-ar itself is in the staging/ directory. The system is
partially based on the Maltese->Hebrew pair; it was developed this
summer as a Google Summer of Code project under mentorship of Kevin
Brubeck Unhammer and Francis Tyers.

Some statistics:

- Number of entries in dictionaries:
-- Maltese monolingual: 7154
-- bilingual: 7685
-- Arabic monolingual: 6220

- Rules:
-- disambiguation: 29
-- transfer: 163 (chunker) + 7 (interchunk)

- Coverage (Maltese monolingual):
-- news corpus: 84.54 % (999722 known words, 1182521 tokenised words)
-- wikipedia: 82.47 % (780288 known, 946197 tokenised)
-- Scannell corpus: 84.27 % (8587965 known, 10191487 tokenised)

Evaluation was done on a regular basis as the project went on (2 texts
of 200 words, 2 texts of 500 words, all taken from the Maltese
wikipedia). Results ranged from 8.70 % to 23.11 % (WER, there were no
unknown words): 8.70% (200 words), 23.11 % (500 words), 17.28 % (200
words), 21.34 % (500 words). Results of the preliminary evaluation -
of what had been done before Google Summer of Code started - were
better: 3.17 % WER; but here a very simple story of 300 words was
used. The evaluation texts can be found in the dev/story/
subdirectory.

We compared the results with Google: the evaluation texts were
translated with Google and postedited. WER figures obtained this way
are higher than the respective apertium-mt-ar results (20.94 % Google
against 8.70 % apertium-mt-ar, 33.53 % against 23.11 %, 35.89 %
against 17.28 %, 39.00 % against 21.34 %, and 47.44 % against 3.17 %
for the simple story). It should be noted though that all the
evaluation texts were actually used in the development of
apertium-mt-ar.
Of course Google deals better with elegant translation of the whole
phrases, but also with issues such as definiteness/indefiniteness in
the output - which is a big problem in apertium-mt-ar translations.
The most striking problems in the Google translations are of
grammatical nature: for example impersonal constructions are often
used when personal forms are expected, incorrect verbal personal forms
are also frequent. This is where apertium-mt-ar performs better.


Maltese->Arabic seemed a promising pair, because Maltese is a dialect
of Arabic - although greatly influenced by Italian and English. But
the two languages are not as similar as I first thought: I
underestimated differences between Arabic dialects and Standard
Arabic, especially when it comes to syntax. Much work on transfer
rules is still needed - that is why I asked my mentors to move the
pair to staging/ rather than to trunk. The current release is an early
one.

Hopefully one day Arabic->Maltese direction will be available as well.
The foundations for this are laid: basic transfer rules are written;
at the moment both Maltese->Arabic and Arabic->Maltese are testvoc
clean. The main issues for Arabic->Maltese are: Arabic disambiguation,
further development of the Arabic analyser (which was written from
scratch) and of the transfer rules.

Thank you for reading.

Best regards,
Maria Fronczak

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to