Note the paper submission deadline: 30 November, 2024

Workshop website: https://comparable.lisn.upsaclay.fr/bucc2025/

COLING website: https://coling2025.org/

Keynote speaker:  Preslav Nakov, Mohamed bin Zayed University of Artificial 
Intelligence, Abu Dhabi

**************************************************************

* Motivation

In the language engineering and linguistics communities, research in
comparable corpora has been motivated by two main reasons. In language
engineering, on the one hand, it is chiefly motivated by the need to
use comparable corpora as training data for statistical NLP
applications such as statistical and neural machine translation or
cross-lingual retrieval. In linguistics, on the other hand, comparable
corpora are of interest because they enable cross-language discoveries
and comparisons. It is generally accepted in both communities that
comparable corpora consist of documents that are comparable in content
and form in various degrees and dimensions across several
languages. Parallel corpora are on the one end of this spectrum, and
unrelated corpora are on the other.

In recent years, the use of comparable corpora for pre-training Large
Language Models (LLMs) has led to their impressive multilingual and
cross-lingual abilities, which are relevant to a range of applications,
including Information Retrieval, Machine Translation, Cross-lingual text
classification, etc. The linguistic definitions and observations related
to comparable corpora can improve methods to mine such corpora or 
to improve cross-lingual transfer of LLMs. Therefore, it is of great interest 
to bring together builders and users of such corpora.

* Shared Task
This year we will run a shared task aimed at detecting translations of
terms via comparable corpora.  Please see the website for details: 
https://comparable.limsi.fr/bucc2025/bucc2025-task.html


* Topics
We solicit contributions on all topics related to comparable (and parallel) 
corpora, including but not limited to the following:

Building Comparable Corpora:
- Automatic and semi-automatic methods
- Methods to mine parallel and non-parallel corpora from the web
- Tools and criteria to evaluate the comparability of corpora
- Parallel vs non-parallel corpora, monolingual corpora
- Rare and minority languages, across language families
- Multi-media/multi-modal comparable corpora


Applications of comparable corpora:
- Human translation
- Language learning
- Cross-language information retrieval & document categorization
- Bilingual and multilingual projections
- (Unsupervised) Machine translation
- Writing assistance
- Machine learning techniques using comparable corpora


Mining from Comparable Corpora:
- Cross-language distributional semantics, word embeddings and
  pre-trained multilingual transformer models
- Extraction of parallel segments or paraphrases from comparable corpora
- Methods to derive parallel from non-parallel corpora (e.g. to provide
  for low-resource languages in neural machine translation)
- Extraction of bilingual and multilingual translations of single words,
  multi-word expressions, proper names, named entities, sentences, and
  paraphrases from comparable corpora, etc.
- Induction of morphological, grammatical, and translation rules from
  comparable corpora
- Induction of multilingual word classes from comparable corpora


Comparable Corpora in the Humanities:


- Comparing linguistic phenomena across languages in contrastive
  linguistics
- Analyzing properties of translated language in translation studies
- Studying language change over time in diachronic linguistics
- Assigning texts to authors via authors' corpora in forensic
  linguistics
- Comparing rhetorical features in discourse analysis
- Studying cultural differences in sociolinguistics
- Analyzing language universals in typological research


* Workshop Organizers
- Serge Sharoff (University of Leeds)
- Ayla Rigouts Terryn (Université de Montréal (UdeM), Mila)
- Pierre Zweigenbaum (Université Paris-Saclay, CNRS, LISN, Orsay)
- Reinhard Rapp (University of Mainz, Germany)


* Program Committee
- Ebrahim Ansari (Institute for Advanced Studies in Basic Sciences,
  Iran)
- Eleftherios Avramidis (DFKI, Germany)
- Gabriel Bernier-Colborne (National Research Council, Canada)
- Thierry Etchegoyhen (Vicomtech, Spain)
- Alex Fraser (University of Munich, Germany)
- Natalia Grabar (University of Lille, France)
- Amal Haddad Haddad (Universidad de Granada, Spain)
- Amir Hazem (University of Tokyo, Japan)
- Kyo Kageura (University of Tokyo, Japan)
- Natalie Kübler (Université Paris Cité, France)
- Philippe Langlais (Université de Montréal, Canada)
- Yves Lepage (Waseda University, Japan).
- Shervin Malmasi (Amazon, USA)
- Michael Mohler (Language Computer Corporation, USA)
- Emmanuel Morin (Nantes Université, France)
- Dragos Stefan Munteanu (RWS, USA)
- Ted Pedersen (University of Minnesota, Duluth, USA)
- Nasredine Semmar (CEA LIST, Paris, France)
- Silvia Severini (Leonardo Labs, Italy)
- Pranaydeep Singh (University of Gent, Belgium)
- Richard Sproat (Google, USA)
- Marko Tadić (University of Zagreb, Croatia)
- François Yvon (Sorbonne Université, France)
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to