Christian,
Thanks much for your clarification statement. Yes, it helps. That is a very good set of statistics from your project.
Another set of statistics in:
GUERRA, Lorena. 2003. Human Translation versus Machine Translation and Full Post-Editing of Raw Machine Translation Output. Master's Thesis. Dublin City University.
http://www.geocities.com/mtpostediting/lorena-guerra-masters.pdf
Also:
Jeff Allen: 2004. Case Study: Implementing MT for the Translation of Pre-sales Marketing and Post-sales Software Deployment Documentation. pp 1-6. In Robert E. Frederking, Kathryn Taylor (Eds.): Machine Translation:
From Real Users to Research, 6th Conference of the Association for MachineTranslation in the Americas, AMTA 2004, Washington, DC, USA, September 28-October 2, 2004, Proceedings. Lecture Notes in Computer Science 3265 Springer 2004, ISBN 3-540-23300-8
http://www.informatik.uni-trier.de/~ley/db/conf/amta/amta2004.html#Allen04
This last one is a case study of telecom project docs which were non-controlled input texts. No doctored texts. Everything carefully logged. All working drafts of production documents saved and versioned at regular intervals. The productivity is completely provable in just a few minutes from all logged and archived info. The texts cannot be distributed because it is customer-sensitive project information for multi-million dollar projects. But for people who want to meet me in person, well, that could be discussed offline....
I'm currently working on an another dictionary building project based on marketing texts (entire potential corpus of a few hundred thousand words of text) with written permission of the publisher. Everything is being very carefully logged to produce another MT case study from it. The initial statistics are already very encouraging.
Christian, thanks for sharing your results with us.
Jeff
Jeff Allen http://www.geocities.com/jeffallenpubs/ [EMAIL PROTECTED] OR [EMAIL PROTECTED]
------------------
From: Christian Boitet <[EMAIL PROTECTED]>
To: "Jeff Allen" <[EMAIL PROTECTED]>, [EMAIL PROTECTED], [EMAIL PROTECTED], [email protected]
Subject: Clarification: MT Italian > English, Romanian<-> Italian
Date: Tue, 29 Mar 2005 15:09:25 +0200
Dear all, 29/3/05
At 11:28 +0000 29/03/05, Jeff Allen wrote:Dear Natalia, Hermann, Christian, and all,
�
The following short article shows that Full Postediting for high-quality professional translation work is very much possible.
ALLEN, Jeff. What is Post-editing? Translation Automation Newsletter, Issue 4. February 2005. Published by Cross-Language.
http://www.geocities.com/mtpostediting/TA_IssueFour.pdf
�
I agree 100% with his article. I am even more optimistic that Jeff (and others, read it!).
Hence, I feel I must have been unclear in my preceding e-mail. Let me clarify my point.
Clarification
-------------
The negative part of what I said is that, WITHOUT UNDERSTANDING THE SOURCE LANGUAGE, or equivalently, if you understand it, WITHOUT ACCESS TO THE SOURCE TEXT, or with VERY LIMITED ACCESS (say, 1 sentence per page, as when one postedits a quality human translation draft), it is not possible.
That is because too many errors have causes one can't possibly understand.
I have tested that for 3 years with engineer students. Very often, they don't even GUESS that something is completely wrong with a translation when they don't understand the source language.
That happens even if the MT system is set to produce more than one translation in case of lexical ambiguity (2 or 3, that is the default for Prompt/Reverso on the web, contrary to Systran, but Systran Pro also has that parameter available). that is because:
in case of lexical ambiguity, there may be a lot more possible translations than 2 or 3
many lexical errors come from unrecognized idiomatic usages (e.g., expressions with support verbs, like "conna�tre un d�veloppement" --> "to know a development")
many other errors come from structural (attachment, scope) or functional (syntactic functions, semantic roles) ambiguities
and a lot of other errors come from the unability of the parser to produce a full, legitimate parse.
Now the positive part (in total agreement with Jeff's view).
The SAME MT system can be useless to produce a good quality translation if you don't know the source language (situation above), or can be a great help to that effect if you know the source language (situation below).
An experiment at ATR
--------------------
I have recently measured my translation time (English > French) using simply Excel to store the original, the translation, and the MT proposal (Systran Pro with adequate parameter settings) side by side, one line per "polyphrase", with the translation initially containing the MT proposal, on 510 sentences of the BTEC corpus (about 12 pages of 250 words).
Result: 12-13 mn per page of 250 words using MT
------ 59 mn per page if not using it (average performance 3 other French natives also using Excel, 2 students and a senior researcher).
==> Time was divided by almost 5 when using MT.
==> That is better than Jeff's estimate, probably because the average length of sentences in this corpus is slighly more than 6 (words).
Remarks:
-------
1) when comparing the 4 translations side by side, all were very good. 3 were very similar, and the one which stuck out was not mine, done with MT, but that of a student who used the "TU" form (as in Canada) rather than the "VOUS" form in travel dialogues.
2) I used "as is" 25% of the sentences (= I edited or fully retyped 75% of them).
3) I confirmed that output rate on 3000 more sentences (without asking others to produce other reference translations).
There are at least 4 reasons for this usefulness:
reading first the source, you don't loose time (and your temper) futilely trying to understand a bad output.
having 25% perfectly good translations already reduces the time by 25%
in most cases, a very bad MT output contains usable translations of words or terms.
last but quite important: you can perform GLOBAL corrections, either downwards or on the whole translation.
For example, Systran would translate "please" by "s.v.p.", which is not acceptable if the output is meant to be a transcription of a spoken utterance. "please" occurs maybe in 20-30% of the sentences. With 510 sentences, 1 such global change saves 10-15 local changes. In the whole BTEC (163000 sentences), it can save up to 8000 changes.
I hope this clarification has been useful.
Best regards,
Ch.Boitet
_______________________________________________ Mt-list mailing list
