Hi, 9/5/2020 I understand you have 2 text files (in the target language), both segmented in sentences (or turns), with aligned segments. To compute the classical BLEU, METEOR, NIST, etc. measures (NOT "metrics"!), you need the written form of the input speech, segmented in the same way. That could be difficult and costly, especially if you don't know the source language very well... which is likely if you need the translated speech.
But why do that? I further suppose you want to evaluate the (linguistic) quality of the MT output. But BLEU and the like do NOT measure linguistic quality, nor usage quality for that matter. Remember the famous paper (EACL 2006, Trento) by Chris Callison-Burch, Miles Osborne and Philipp Koehn, "Re-evaluating the Role of BLEU in MT Research", which shows that assertion. Their conclusion was that BLEU does not and cannot measure "quality". A usage quality can be well defined (contrary to "adequacy") according to a given task (understanding, producing a professional translation, gisting...).What I would do is to measure the "postedit usage quality" of the MT output. For that, I would compute (for each segment) a postedit distance (combining character-based and word-based string distances) -- NOT a global similarity. Such a distance can be used to estimate the effort to post-edit the MT result (into the reference if you have one, or into the result after PE). In an experiment with a Moses-based French-Chinese MT system built by Lingxiao Wang, Haozhou Wang found (see his Master project) that 1 unit of our post-edit distance corresponded to 2 seconds of manual post-editing. From that, one can estimate the time taken by PE per page (of 250 words, or 1400-1500 characters in alphabetical languages, or 400-450 characters in ideographically languages), taking as basis 1 hour per page without machine help. Haozhou is a PhD student at Unige, contact hime for more details. My proposal for getting a qualitative estimate, in the context of post-editing, is as follows: 5 mn/page: excellent (18/20) 10 mn/page: very good (16/20) 15 mn/page: good (14/20) 20 mn/page: fair (12/20) 25 mn/page: just OK (10/20) 30 mn/page: not OK ( 8/20) -- less saving than the maximum saving one can get with a tool based on a good translation memory. Best regards, Ch.Boitet > Le 8 mai 2020 à 20:14, Vincent Vandeghinste <[email protected]> a écrit > : > > Dear MT'ers, > > Maybe some of you can answer the following question: > > I have a speech recognition based translation of a speech, with punctuation > predictions etc. > > I have a sentence-based reference translation, one sentence per line. > > The sentence predictions of the speech translation system do not necessarily > match the sentences of the reference file. > > How can I align my speech translation with the reference sentences so I can > calculate BLEU scores and the like? > > Are there any scripts available for that? or papers? > > Thank you, > > kind regards, > > Vincent Vandeghinste > > _______________________________________________ > Mt-list site list > [email protected] > http://lists.eamt.org/mailman/listinfo/mt-list ------------------------------------------------------------------------- Christian Boitet (Pr. émérite Université Grenoble Alpes) Laboratoire d'Informatique de Grenoble L I G Groupe d'Etude pour la Traduction Automatique et le Traitement Automatisé des Langues et de la Parole G E T A L P --- Adresse postale --- GETALP, LIG-campus Bâtiment IMAG, bureau 339 CS 40700 38058 Grenoble Cedex 9 France
_______________________________________________ Mt-list site list [email protected] http://lists.eamt.org/mailman/listinfo/mt-list
