Hi,                             9/5/2020

I understand you have 2 text files (in the target language), both segmented in 
sentences (or turns), with aligned segments.
To compute the classical BLEU, METEOR, NIST, etc. measures (NOT "metrics"!), 
you need the written form of the input speech, segmented in the same way.
That could be difficult and costly, especially if you don't know the source 
language very well... which is likely if you need the translated speech.

But why do that?
I further suppose you want to evaluate the (linguistic) quality of the MT 
output. But BLEU and the like do NOT measure linguistic quality, nor usage 
quality for that matter.
Remember the famous paper (EACL 2006, Trento) by Chris Callison-Burch, Miles 
Osborne and Philipp Koehn, "Re-evaluating the Role of BLEU in MT Research", 
which shows that assertion. Their conclusion was that BLEU does not and cannot 
measure "quality". 
A usage quality can be well defined (contrary to "adequacy") according to a 
given task (understanding, producing a professional translation, 
gisting...).What I would do is to measure the "postedit usage quality" of the 
MT output. 

For that, I would compute (for each segment) a postedit distance (combining 
character-based and word-based string distances) -- NOT a global similarity. 
Such a distance can be used to estimate the effort to post-edit the MT result 
(into the reference if you have one, or into the result after PE). 

In an experiment with a Moses-based French-Chinese MT system built by Lingxiao 
Wang, Haozhou Wang found (see his Master project) that 1 unit of our post-edit 
distance corresponded to 2 seconds of manual post-editing. From that, one can 
estimate the time taken by PE per page (of 250 words, or 1400-1500 characters 
in alphabetical languages, or 400-450 characters in ideographically languages), 
taking as basis 1 hour per page without machine help. Haozhou is a PhD student 
at Unige, contact hime for more details.
My proposal for getting a qualitative estimate, in the context of post-editing, 
is as follows:

 5 mn/page:     excellent (18/20)
10 mn/page:     very good (16/20)
15 mn/page:     good      (14/20)
20 mn/page:     fair      (12/20)
25 mn/page:     just OK   (10/20)
30 mn/page:     not OK    ( 8/20) -- less saving than the maximum saving one 
can get with a tool based on a good translation memory.

Best regards,

Ch.Boitet



> Le 8 mai 2020 à 20:14, Vincent Vandeghinste <[email protected]> a écrit 
> :
> 
> Dear MT'ers,
> 
> Maybe some of you can answer the following question:
> 
> I have a speech recognition based translation of a speech, with punctuation 
> predictions etc.
> 
> I have a sentence-based reference translation, one sentence per line.
> 
> The sentence predictions of the speech translation system do not necessarily 
> match the sentences of the reference file.
> 
> How can I align my speech translation with the reference sentences so I can 
> calculate BLEU scores and the like?
> 
> Are there any scripts available for that? or papers?
> 
> Thank you,
> 
> kind regards,
> 
> Vincent Vandeghinste
> 
> _______________________________________________
> Mt-list site list
> [email protected]
> http://lists.eamt.org/mailman/listinfo/mt-list

-------------------------------------------------------------------------
Christian Boitet
(Pr. émérite Université Grenoble Alpes)
Laboratoire d'Informatique de Grenoble
L             I               G
Groupe d'Etude pour la Traduction Automatique
                 et le Traitement Automatisé des Langues et de la Parole
G        E             T          A              L                P

--- Adresse postale ---
GETALP, LIG-campus
Bâtiment IMAG, bureau 339
CS 40700
38058 Grenoble Cedex 9
France           

_______________________________________________
Mt-list site list
[email protected]
http://lists.eamt.org/mailman/listinfo/mt-list

Reply via email to